Skip to content

Conversation

@amogh-jahagirdar
Copy link
Contributor

Co-authored-by: Mathew Fournier [email protected]

Refer to #10445 for more details.

This change defines a RepairManifest Action interface where implementations should produce a new set of manifests where
duplicate entries should be removed, and entries which refer to files which no longer exist on disk should also be removed.

Comment on lines 24 to 25
* An action that will repair manifests. Implementations should produce a new set of manifests where
* duplicate entries are removed and entries which refer to files which no longer exist are removed.
Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Jul 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interface is prescriptive on what implementations should do in terms of duplicate entries and what implementations should do when files which no longer exist. After some thought I think it's right to define that but I can see arguments where someone wants implementations to just fail in case there are entries which refer to files which no longer exist (Note: I'm specifically talking about cases where implementations cannot recover those files). To me though that can just be an additional validation API someone adds down the line

Also I think at some level the interface needs to be prescriptive to have a clear defined Result.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I think we may be missing a few other things here and I'm not sure if they should all happen on every invocation. In my mind

  1. Read metadata from data files and recompute values in manifests
  2. Remove Manifests or files with FNF
  3. Removing duplicates

I think we probably also want to be able to do a dry run of this? Something that checks these 3 things but doesn't actually commit any changes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVM (dry run is in the api below)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amogh-jahagirdar for this. Yea Russell's list sounds good, optionally, I guess we can put the options in later when we implement them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RussellSpitzer @szehon-ho I largely wanted to avoid a user having to think through which options they need to specify, and make sure implementations have sane defaults but I think I agree at least at the actions API level we could have separate configuraiton methods. When the procedure implementation is there, users could pass in what they want and the procedure implementation has some sane defaults (power users who use the raw actions APIs would have to set every option they want explicitly). cc @danielcweeks in case he had any thoughts.

Iterable<ManifestFile> addedManifests();

/** Returns the number of duplicate files removed */
long duplicateFilesRemoved();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be more useful if this was an Iterable of the actual duplicate files? same with the missing files below?

TBH, it didn't seem useful to me because someone who is running repair probably wants to know if the table was corrected and to which extent was it corrected, rather than the individual files....but lmk what you think @szehon-ho @RussellSpitzer @danielcweeks

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kind of feel like we do want the iterable here but the procedure could narrow this to the count in the result there

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont have a strong preference either way, but if we dont expect millions, it makes sense to have iterable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I think we can just put the Iterable here. Hopefully in the average repair case, it shouldn't really be millions of duplicate files?

*
* @return this for method chaining
*/
RepairManifests dryRun();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be rather part of the actual Spark procedure rather than in the API? As I'm not sure how one would examine the results of a dry run in this case. For example, RemoveOrphanFilesProcedure has a dry run option part of the procedure, but not as part of the DeleteOrphanFiles API

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be part of "apply()" which should already exist because this inherits PendingUpdate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be part of "apply()" which should already exist because this inherits PendingUpdate?

@RussellSpitzer this inherits SnapshotUpdate in the Actions API, there's no apply() in that.

I think you're thinking of the other SnapshotUpdate that's part of the "normal" API. This brings up a good point of should this also have an interface there or not. RewriteManifests is defined in the regular API and then there's also an action for that.

The purpose of the Actions API is to enable engines to provide distributed/other custom implementations of procedures, and I think here we could just start by having it as an action and if there's a desire to add it to the typical API we could do it at that time?

should this be rather part of the actual Spark procedure rather than in the API? As I'm not sure how one would examine the results of a dry run in this case. For example, RemoveOrphanFilesProcedure has a dry run option part of the procedure, but not as part of the DeleteOrphanFiles API

Good point, I missed that this is what orphan file removal does. If I think through it a bit more: by defining dryRun() at the interface level, what we're saying is implementations must have a way of supporting that (the implementation could ignore the option technically but that would be a bad implementation of the interface).

If we look at what orphan file does, the action interface has a "deleteWith" API and the procedure will pass in a no-op deletion function.

What we could do is have an action API "commitRepairWith()" which will accept some sort of commit function to perform the manifest rewrite or if the procedure is a dry-run, some no-op commit.
It feels like it complicates the API more to do that.

I think the real fix is to return Result as part of the dryRun...let me double check if this is feasible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think it's good as it is @nastra. If a user wants to get the results, in the current API they'd do:

RepairManifests.Result result = actionsProvider.repairManifests(table).repairFileEntries().repairEntryStats().dryRun().execute();

// Do whatever with result

* An action that will repair manifests. Implementations should produce a new set of manifests where
* duplicate entries are removed and entries which refer to files which no longer exist are removed.
*/
public interface RepairManifests extends SnapshotUpdate<RepairManifests, RepairManifests.Result> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this maybe extend Action like the other actions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think SnapshotUpdate is the correct one to extend, here's my thinking:

1.) SnapshotUpdate already extends Action
2.) I think implementations necessarily will produce snapshots when they perform repairs (putting aside the dry run API option). The new snapshot that would be produced, at least in my mind in the reference implementation would be essentially a "rewrite manifests"-like snapshot.

Copy link
Contributor

@nastra nastra Aug 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah nvm, this seems to be correct. I thought it's extending org.apache.iceberg.SnapshotUpdate while it's actually extending org.apache.iceberg.actions.SnapshotUpdate

Copy link
Member

@szehon-ho szehon-ho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amogh-jahagirdar for pushing this forward!

Comment on lines 24 to 25
* An action that will repair manifests. Implementations should produce a new set of manifests where
* duplicate entries are removed and entries which refer to files which no longer exist are removed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amogh-jahagirdar for this. Yea Russell's list sounds good, optionally, I guess we can put the options in later when we implement them.

Iterable<ManifestFile> addedManifests();

/** Returns the number of duplicate files removed */
long duplicateFilesRemoved();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont have a strong preference either way, but if we dont expect millions, it makes sense to have iterable?

@amogh-jahagirdar amogh-jahagirdar force-pushed the repair-manifests-api branch 2 times, most recently from 4b2ddfe to 83ca336 Compare August 1, 2024 18:20
*
* @return this for method chaining
*/
RepairManifests dryRun();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think it's good as it is @nastra. If a user wants to get the results, in the current API they'd do:

RepairManifests.Result result = actionsProvider.repairManifests(table).repairFileEntries().repairEntryStats().dryRun().execute();

// Do whatever with result

Comment on lines +30 to +33
RepairManifests repairEntryStats();

/**
* Configuration method for removing duplicate file entries and removing files which no longer
* exist in storage
*/
RepairManifests repairFileEntries();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the previous discussion @szehon-ho @RussellSpitzer the API now has 2 configuration methods for repairing file entries and reparing entry stats. The first one encompasses both the duplicate + missing case (I don't think there's a need to differentiate those options)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, for completness I wonder if we need to dinstinguish what to repair.. ie, file_size_in_bytes vs metrics ,etc. Can just read the filesystem metadata and save from reading the footer, for example. Probably it is too fine grained, and this option is fine?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like if we're going through the process of hitting the file (to check size or read footer), we should just repair everything. I can't see a case where you would want to rewrite the information but possibly carryover incorrect information.

RepairManifests repairEntryStats();

/**
* Configuration method for removing duplicate file entries and removing files which no longer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo?

Comment on lines +30 to +33
RepairManifests repairEntryStats();

/**
* Configuration method for removing duplicate file entries and removing files which no longer
* exist in storage
*/
RepairManifests repairFileEntries();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, for completness I wonder if we need to dinstinguish what to repair.. ie, file_size_in_bytes vs metrics ,etc. Can just read the filesystem metadata and save from reading the footer, for example. Probably it is too fine grained, and this option is fine?

*/
RepairManifests dryRun();

interface Result {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The one thing I feel is missing here is stats on how many entires were fixed for stats. Do we just assume everything was corrected? It's unclear from a dry run result if running the operation would result in any change if there are no duplicate/missing/recovered files, but you do need to run for missing stats.

Unfortunately, I would expect that fixed stats could be a very large number of files, which wouldn't work with the iterable model for the other categories of issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I don't think we should assume everything was corrected since it involves a footer read, some computation, and in all that flow things can go wrong in that for an individual file but not for another.

I've added two result methods for this:

    /** Returns the number of manifest entries for which stats were incorrect */
    long entryStatsIncorrect();

    /** Returns the number of manifest entries for which stats were corrected */
    long entryStatsRepairedCount();

I think it's important to distinguish between how many were repaired and how many were incorrect to begin with. WIth only the repair count, if it's 0 a user wouldn't be able to distinguish "was everything actually OK with the stats" vs "Hey a bunch of entry stats are incorrect but we couldn't actually do the repair".

Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also removed "added manifests", because as I was working through the implementation a bit more, I came to the conclusion that it's not needed. I think it was just a copy from the rewrite manifests results but
my thinking is that for repairing we're always working with some existing manifests and producing new manifests with rewritten contents but the "source" is always some existing manifest for repair. For repair there isn't really an "added" manifest in the sense of how the rewrite procedure defines it. cc @danielcweeks @szehon-ho @RussellSpitzer . It's also something we could add later on with a default implementation if we realize we need it.

*/
RepairManifests dryRun();

interface Result {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I don't think we should assume everything was corrected since it involves a footer read, some computation, and in all that flow things can go wrong in that for an individual file but not for another.

I've added two result methods for this:

    /** Returns the number of manifest entries for which stats were incorrect */
    long entryStatsIncorrect();

    /** Returns the number of manifest entries for which stats were corrected */
    long entryStatsRepairedCount();

I think it's important to distinguish between how many were repaired and how many were incorrect to begin with. WIth only the repair count, if it's 0 a user wouldn't be able to distinguish "was everything actually OK with the stats" vs "Hey a bunch of entry stats are incorrect but we couldn't actually do the repair".

*/
RepairManifests dryRun();

interface Result {
Copy link
Contributor Author

@amogh-jahagirdar amogh-jahagirdar Aug 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also removed "added manifests", because as I was working through the implementation a bit more, I came to the conclusion that it's not needed. I think it was just a copy from the rewrite manifests results but
my thinking is that for repairing we're always working with some existing manifests and producing new manifests with rewritten contents but the "source" is always some existing manifest for repair. For repair there isn't really an "added" manifest in the sense of how the rewrite procedure defines it. cc @danielcweeks @szehon-ho @RussellSpitzer . It's also something we could add later on with a default implementation if we realize we need it.

@szehon-ho
Copy link
Member

Hi, @amogh-jahagirdar @RussellSpitzer had a question. I think one motivation was to handle CheckSnapshotIntegrity API proposed in #10642

However in that pr, it is also checking metadata.jsons and manifest-list. for missing metadata-file locations.

I think there are other cases it may be worth to repair, for example snapshot summary metadata can be repaired if they are ever out of sync (as the values are mostly updated incrementally and one bad update can cascade for subsequent snapshots).

What do you guys think about a generic RepairTable, and have some analgous methods repairSnapshotSummaries or repairSnapshots ?

Otherwise we may still need #10642 to do those checks

@szehon-ho
Copy link
Member

Another one is : #10535 corrupt manifest file. It may make sense to repair those as well. So in summary:

  • repairSnapshotMetadata (read snapshot file and remove or repair snapshot metadata)
  • repairManifestMetadata (read manifest file and remove or repair manifest metadata)
  • repairFileMetadata (read data/delete file and remove or repair file metadata)

Let me know what you think, if its within the scope here

@amogh-jahagirdar
Copy link
Contributor Author

@szehon-ho I think I agree that the use cases presented are legitimate repair cases however I'm not sure that we should be combining all of them into a single RepairTable procedure upfront. I understand the appeal but my thinking is that the implementation gets complex. My thinking is that users will typically hit some issue, like missing file referenced in their manifest or hit duplicate files and then just want to solve that issue with a focused procedure. Or in the other missing metadata file/missing manifest list/manifest/corrupted snapshot case run a separate procedure.

That said one thing to consider is do we need these procedures to move in lockstep somehow to ensure that statistics/snapshot summaries roll up correctly? I don't think that's the case since in most cases we are producing new snapshots entirely after the repair and the deltas in rows/files added/deleted should be accumulated correctly? We'll probably need tests for that.

I do think that we could add a RepairTable down the line which combines multiple smaller repair actions so that the implementation of RepairTable itself isn't super complex. It comes at the cost of having more in the action API surface area but I feel like that's OK?

More concretely at least what I'm thinking so far:

  1. RepairManifests which is focused on looking at the table's current manifests and removing duplicates/attempting recovery of missing files or deleting those entries if those files cannot be recovered
  2. RepairTableMetadata which would do the historical metadata file recovery, manifest list repairs and snapshots?

@szehon-ho I actually had a question on the snapshot repair, based on the description the goal of that is to repair snapshot summary stats which may have been corrupted. Doesn't that necessarily mean we must mutate the existing snapshot (and subsequent snapshots) to correct it?

Overall though I'll def give the general RepairTable more thought, just feels a bit complex to put everything in one go up front rather than have focused actions.

@amogh-jahagirdar amogh-jahagirdar requested a review from rdblue August 8, 2024 04:51
@szehon-ho
Copy link
Member

Makes sense @amogh-jahagirdar !

@szehon-ho I actually had a question on the snapshot repair, based on the description the goal of that is to repair snapshot summary stats which may have been corrupted. Doesn't that necessarily mean we must mutate the existing snapshot (and subsequent snapshots) to correct it?

i was just thinking we make a new snapshot with correct summary stats (only the totals). But yes you are right, it is open to interpretation, in this case you cant go back and fix wrong stats, so this particular feature probably does need more thought.

@amogh-jahagirdar
Copy link
Contributor Author

Makes sense @amogh-jahagirdar !

@szehon-ho I actually had a question on the snapshot repair, based on the description the goal of that is to repair snapshot summary stats which may have been corrupted. Doesn't that necessarily mean we must mutate the existing snapshot (and subsequent snapshots) to correct it?

i was just thinking we make a new snapshot with correct summary stats (only the totals). But yes you are right, it is open to interpretation, in this case you cant go back and fix wrong stats, so this particular feature probably does need more thought.

Really sorry for the delayed response @szehon-ho I forgot I had this open. So I was discussing this with @rdblue and I am more convinced of your point that we may as well have a unified RepairTable action with different configuration methods. The compelling arguments for me at least is that it follows the existing API patterns, for example ManageSnapshots has different options but serves as a useful entry point. RepairTable is a useful entry point for a user, there can be sane defaults on what to repair, and power users can specify which specific operations they want to run if they know what's broken. Another example of this came up in https://github.com/apache/iceberg/pull/10755/files#r1696094932 for removing unused specs where we were thinking of having an entry point maintenance API for both removing unused specs and schemas. So from an API consistency perspective, I think it's good to have the same pattern for this action.

Furthermore, the procedure is traversing the entire metadata tree so practically there's probably overlap across the different repair operations.

I can update the PR based on this

@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Nov 10, 2024
@github-actions
Copy link

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Nov 17, 2024
@danielcweeks danielcweeks reopened this Jan 28, 2025
@github-actions github-actions bot removed the stale label Jan 29, 2025
@wypoon
Copy link
Contributor

wypoon commented Jan 30, 2025

For my edification, can someone please explain how duplicate file entries in manifests can arise? Can two entries for the same file occur in a single manifest? Can even two manifests be in the same manifest list if they overlap (have an entry for the same file in common)? I'd have thought both of these situations would be bugs. Or are there actual sequences of operations that lead to such outcomes, similar to how dangling deletes can occur?
Basically, I'm trying to understand what repairs might be needed due to operations in Iceberg (and which of these are bugs), and what are needed only due to users doing something outside of Iceberg. Deleting a file from storage falls into the latter category.
Also, I understand that there was an old bug where data file size was written incorrectly and this actually caused reads to fail, and this is the motivation for correcting the statistics in metadata. However, that bug was long fixed, so I wonder if there are still known situations where these statistics need to be corrected.

@amogh-jahagirdar
Copy link
Contributor Author

amogh-jahagirdar commented Jan 30, 2025

For my edification, can someone please explain how duplicate file entries in manifests can arise? Can two entries for the same file occur in a single manifest? Can even two manifests be in the same manifest list if they overlap (have an entry for the same file in common)? I'd have thought both of these situations would be bugs. Or are there actual sequences of operations that lead to such outcomes, similar to how dangling deletes can occur?

Sure, so one example of an issue that happened in the past is that in the Kafka Connect integration we ended up appending the same file multiple times. We rectified that in Kafka connect and the Iceberg library for duplicate appends in the same snapshot, but it's still technically possible to append the same file across different snapshots (at least in the reference implementation and probably a few others).

Detecting overlapping files involves an expensive read through current manifest(s) to deduplicate which if performed on every append would be prohibitively expensive for the operation.

Overlapping files across manifests does imply a bug in whatever integration is writing to the table. However, even after those bugs are fixed, it still makes sense for Iceberg to expose repair procedures to correct those tables to unblock users from using their tables.
More generally, imo it makes sense to offer a general RepairTable procedure with different options which enable users to be able to correct their tables as best as possible, in the case a bad implementation ended up writing to them.

Can even two manifests be in the same manifest list if they overlap (have an entry for the same file in common)?
That'd also be a bug. I haven't seen any issues on that but I wouldn't be opposed to the repair table procedure handling that (probably not the biggest concern imo). It's relatively easy to detect and dedupe that since it's high in the metadata tree.

Also, I understand that there was an old bug where data file size was written incorrectly and this actually caused reads to fail, and this is the motivation for correcting the statistics in metadata. However, that bug was long fixed, so I wonder if there are still known situations where these statistics need to be corrected.

Yeah I think the same I mentioned above applies where some random implementation may end up writing incorrect statistics for whatever reason, and it'd be good for repair table to correct that since stats are something that can be deterministically corrected.

@wypoon
Copy link
Contributor

wypoon commented Jan 30, 2025

Another question I have about duplicate entries (in the manifests) is: do their presence make the table unreadable? Or is the table still readable and it is a valid state, although undesirable (as is the case with dangling deletes, for example)?

An observation is that the existing subinterfaces of org.apache.iceberg.actions.SnapshotUpdate are all actions that do not change the state of the table (the data remains the same). This is not the case with removing data (or delete) files from the metadata because those files are missing from the storage. In that case, the state of the table is changed; either data is deleted or added (due to "undeletion" from removing delete files) or both. For this reason, I think that removing files from metadata is logically different from "repair" operations such as deduplication of entries or correcting statistics for entries.

@flyrain
Copy link
Contributor

flyrain commented Feb 22, 2025

Do we still pursue this?

@danielcweeks
Copy link
Contributor

Do we still pursue this?

Yes, @amogh-jahagirdar is still looking at this and we continue to see use cases where this is necessary.

@flyrain
Copy link
Contributor

flyrain commented Feb 24, 2025

@danielcweeks thanks for the confirmation. One of use cases is to check the integrity of a copied table from different places. The use cases are increasing as rewriteTablePath was out with the release 1.8. People will need a tool to validate before they feel confident that the migration is done.

@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Mar 27, 2025
@github-actions
Copy link

github-actions bot commented Apr 3, 2025

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Apr 3, 2025
@mdub
Copy link

mdub commented Jul 4, 2025

I see this PR has been closed, which is sad. Should it be re-opened, @amogh-jahagirdar, @danielcweeks?

I'd love to see repairManifests integrated, as it was a life-saver when we experienced corruption in our Iceberg tables, earlier this year.

@scottedwards2000
Copy link

@danielcweeks thanks for the confirmation. One of use cases is to check the integrity of a copied table from different places. The use cases are increasing as rewriteTablePath was out with the release 1.8. People will need a tool to validate before they feel confident that the migration is done.

As an enthusiastic user of the new rewriteTablePath, I would find this use case compelling. I didn't actually realize that was within the scope of this PR since the name implies a focus on "repair". Any ETA on when we could use it for this migration confirmation?

@mrcnc
Copy link
Contributor

mrcnc commented Aug 6, 2025

Another question I have about duplicate entries (in the manifests) is: do their presence make the table unreadable? Or is the table still readable and it is a valid state, although undesirable (as is the case with dangling deletes, for example)?

I believe this depends on the reader. According to these docs it appears Snowflake will not read tables that have duplicates in the manifest files

@amogh-jahagirdar
Copy link
Contributor Author

Ok this has been open for quite some time, and I haven't been able to follow through with this due to bandwidth but now I'm going to make time to follow through on this since I've received multiple reach outs on the usefulness of this procedure. From what I recall the pending decision and update had to do with making this a generic RepairTable Action API and exposing specific configuration options for selecting which operations to run.

I will update this week when I get a chance, in the mean time for folks who have table with duplicate manifests can build and run from https://github.com/amogh-jahagirdar/iceberg/tree/repair-manifests-action

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants